Saturday, June 28, 2025
News PouroverAI
Visit PourOver.AI
No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing
News PouroverAI
No Result
View All Result

What’s The Story With HNSW?. Exploring the path to fast nearest… | by Ryan McDermott | Feb, 2024

February 25, 2024
in Data Science & ML
Reading Time: 6 mins read
0 0
A A
0
Share on FacebookShare on Twitter






Exploring the path to fast nearest neighbour search with Hierarchical Navigable Small Worlds

Exploring the path to fast nearest neighbour search with Hierarchical Navigable Small Worlds

13 min read · 12 hours ago

Hierarchical Navigable Small World (HNSW) has become popular as one of the best performing approaches for approximate nearest neighbour search. HNSW is a little complex though, and descriptions often lack a complete and intuitive explanation. This post takes a journey through the history of the HNSW idea to help explain what “hierarchical navigable small world” actually means and why it’s effective.

Approximate Nearest Neighbour Search

Small Worlds

Navigable Small Worlds

Hierarchical Navigable Small Worlds

Summary

Appendix

  • Improved search
  • HNSW search & insertion
  • Improved insertion

References

A common application of machine learning is nearest neighbour search, which means finding the most similar items* to a target — for example, to recommend items that are similar to a user’s preferences, or to search for items that are similar to a user’s search query.

The simple method is to calculate the similarity of every item to the target and return the closest ones. However, if there are a large number of items (maybe millions), this will be slow.

Instead, we can use a structure called an index to make things much faster.

There is a trade-off, however. Unlike the simple method, indexes only give approximate results: we may not retrieve all of the nearest neighbours (i.e. recall may be less than 100%).

There are several different types of index (e.g. locality sensitive hashing; inverted file index), but HNSW has proven particularly effective on various datasets, achieving high speeds while keeping high recall.*Typically, items are represented as embeddings, which are vectors produced by a machine learning model; the similarity between items corresponds to the distance between the embeddings. This post will usually talk of vectors and distances, though in general HNSW can handle any kind of items with some measure of similarity.

Illustration of the small-world experiment.

Small worlds were famously studied in Stanley Milgram’s small-world experiment [1]. Participants were given a letter containing the address and other basic details of a randomly chosen target individual, along with the experiment’s instructions. In the unlikely event that they personally knew the target, they were instructed to send them the letter; otherwise, they were told to think of someone they knew who was more likely to know the target, and send the letter on to them.

The surprising conclusion was that the letters were typically only sent around six times before reaching the target, demonstrating the famous idea of “six degrees of separation” — any two people can usually be connected by a small chain of friends.

In the mathematical field of graph theory, a graph is a set of points, some of which are connected. We can think of a social network as a graph, with people as points and friendships as connections. The small-world experiment found that most pairs of points in this graph are connected by short paths that have a small number of steps. (This is described technically as the graph having a low diameter.)

Illustration of a small world. Most connections (grey) are local, but there are also long-range connections (green), which create short paths between points, such as the three step path between points A and B indicated with arrows.

Having short paths is not that surprising in itself: most graphs have this property, including graphs created by just connecting random pairs of points. But social networks are not connected randomly, they are highly local: friends tend to live close to each other, and if you know two people, it’s quite likely they know each other too. (This is described technically as the graph having a high clustering coefficient.) The surprising thing about the small-world experiment is that two distant points are only separated by a short path despite connections typically being short-range.

In cases like these when a graph has lots of local connections, but also has short paths, we say the graph is a small world.

Another good example of a small world is the global airport network. Airports in the same region are highly connected to one another, but it’s possible to make a long journey in only a few stops by making use of major hub airports. For example, a journey from Manchester, UK to Osaka, Japan typically starts with a local flight from Manchester to London, then a long distance flight from London to Tokyo, and finally another local flight from Tokyo to Osaka. Long-range hubs are a common way of achieving the small world property.

A final interesting example of graphs with the small world property is biological neural networks such as the human brain.

In small world graphs, we can quickly reach a target in a few steps. This suggests a promising idea for nearest neighbour search: perhaps if we create connections between our vectors in such a way that it forms a small world graph, we can quickly find the vectors near a target by starting from an arbitrary “entry point” vector and then navigating through the graph towards the target.

This possibility was explored by Kleinberg [2]. He noted that the existence of short paths wasn’t the only interesting thing about Miller’s experiment: it was also surprising that people were able to find these short paths, without using any global knowledge about the graph. Rather, the people were following a simple greedy algorithm. At each step, they examined each of their immediate connections, and sent it to the one they thought was closest to the target. We can use a similar algorithm to search a graph that connects vectors.

Illustration of the greedy search algorithm. We are searching for the vector that is nearest the target X. Starting at the entry point E, we check the distance to X of each vector connected to E (indicated by the arrows from E), and go to the closest one (indicated by the red arrow from E). We repeat this procedure at successive vectors until we reach Y. As Y has no connections that are closer to X than Y itself, we stop and return Y.

Kleinberg wanted to know whether this greedy algorithm would always find a short path. He ran simple simulations of small worlds in which all of the points were connected to their immediate neighbours, with additional longer connections created between random points. He discovered that the greedy algorithm would only find a short path in specific conditions, depending on the lengths of the long-range connections.

If the long-range connections were too long (as was the case when they connected pairs of points in completely random locations), the greedy algorithm could follow a long-range connection to quickly reach the rough area of the target, but after that the long-range connections were of no use, and the path had to step through the local connections to get closer. On the other hand, if the long-range connections were too short, it would simply take too many steps to reach the area of the target.

If, however, the lengths of the long-range connections were just right (to be precise, if they were uniformly distributed, so that all lengths were equally likely), the greedy algorithm would typically reach the neighbourhood of the target in an especially small number of steps (to be more specific, a number proportional to log(n), where n is the number of points in the graph).

In cases like this where the greedy algorithm can find the target in a small number of steps, we say the small world is a navigable small world (NSW).

An NSW sounds like an ideal index for our vectors, but for vectors in a complex, high-dimensional space, it’s not clear how to actually build one. Fortunately, Malkov et al. [3] discovered a method: we insert one randomly chosen vector at a time to the graph, and connect it to a small number m of nearest neighbours that were already inserted.

Illustration of building an NSW. Vectors are inserted in a random order and connected to the nearest m = 2 inserted vectors. Note how the first vectors to be inserted form long-range connections while later vectors form local connections.

This method is remarkably simple and requires no global understanding of how the vectors are distributed in space. It’s also very efficient, as we can use the graph built so far to perform the nearest neighbour search for inserting each vector.

Experiments confirmed that this method produces an NSW. Because the vectors inserted early on…






Source link

Tags: ExploringFastfebHNSWMcDermottnearestâPATHRyanStoryWhatâs
Previous Post

Capital One VentureOne vs. Venture vs. Venture X

Next Post

Mapping out the connections of Oscar Winners | by Milan Janosov | Feb, 2024

Related Posts

AI Compared: Which Assistant Is the Best?
Data Science & ML

AI Compared: Which Assistant Is the Best?

June 10, 2024
5 Machine Learning Models Explained in 5 Minutes
Data Science & ML

5 Machine Learning Models Explained in 5 Minutes

June 7, 2024
Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’
Data Science & ML

Cohere Picks Enterprise AI Needs Over ‘Abstract Concepts Like AGI’

June 7, 2024
How to Learn Data Analytics – Dataquest
Data Science & ML

How to Learn Data Analytics – Dataquest

June 6, 2024
Adobe Terms Of Service Update Privacy Concerns
Data Science & ML

Adobe Terms Of Service Update Privacy Concerns

June 6, 2024
Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart
Data Science & ML

Build RAG applications using Jina Embeddings v2 on Amazon SageMaker JumpStart

June 6, 2024
Next Post
Mapping out the connections of Oscar Winners | by Milan Janosov | Feb, 2024

Mapping out the connections of Oscar Winners | by Milan Janosov | Feb, 2024

Trump comments on Black voters draws rebuke from Haley, Democrats By Reuters

Trump comments on Black voters draws rebuke from Haley, Democrats By Reuters

Researchers from NVIDIA and the University of Maryland Propose ODIN: A Reward Disentangling Technique that Mitigates Hacking in Reinforcement Learning from Human Feedback (RLHF)

Researchers from NVIDIA and the University of Maryland Propose ODIN: A Reward Disentangling Technique that Mitigates Hacking in Reinforcement Learning from Human Feedback (RLHF)

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Trending
  • Comments
  • Latest
23 Plagiarism Facts and Statistics to Analyze Latest Trends

23 Plagiarism Facts and Statistics to Analyze Latest Trends

June 4, 2024
How ‘Chain of Thought’ Makes Transformers Smarter

How ‘Chain of Thought’ Makes Transformers Smarter

May 13, 2024
Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

Amazon’s Bedrock and Titan Generative AI Services Enter General Availability

October 2, 2023
Is C.AI Down? Here Is What To Do Now

Is C.AI Down? Here Is What To Do Now

January 10, 2024
The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

The Importance of Choosing a Reliable Affiliate Network and Why Olavivo is Your Ideal Partner

October 30, 2023
Managing PDFs in Node.js with pdf-lib

Managing PDFs in Node.js with pdf-lib

November 16, 2023
Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

Can You Guess What Percentage Of Their Wealth The Rich Keep In Cash?

June 10, 2024
AI Compared: Which Assistant Is the Best?

AI Compared: Which Assistant Is the Best?

June 10, 2024
How insurance companies can use synthetic data to fight bias

How insurance companies can use synthetic data to fight bias

June 10, 2024
5 SLA metrics you should be monitoring

5 SLA metrics you should be monitoring

June 10, 2024
From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

From Low-Level to High-Level Tasks: Scaling Fine-Tuning with the ANDROIDCONTROL Dataset

June 10, 2024
UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

UGRO Capital: Targeting to hit milestone of Rs 20,000 cr loan book in 8-10 quarters: Shachindra Nath

June 10, 2024
Facebook Twitter LinkedIn Pinterest RSS
News PouroverAI

The latest news and updates about the AI Technology and Latest Tech Updates around the world... PouroverAI keeps you in the loop.

CATEGORIES

  • AI Technology
  • Automation
  • Blockchain
  • Business
  • Cloud & Programming
  • Data Science & ML
  • Digital Marketing
  • Front-Tech
  • Uncategorized

SITEMAP

  • Disclaimer
  • Privacy Policy
  • DMCA
  • Cookie Privacy Policy
  • Terms and Conditions
  • Contact us

Copyright © 2023 PouroverAI News.
PouroverAI News

No Result
View All Result
  • Home
  • AI Tech
  • Business
  • Blockchain
  • Data Science & ML
  • Cloud & Programming
  • Automation
  • Front-Tech
  • Marketing

Copyright © 2023 PouroverAI News.
PouroverAI News

Welcome Back!

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Fill the forms bellow to register

All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In